#### CS250P: Computer Systems Architecture Explicit Parallelism



Sang-Woo Jun Fall 2022



Large amount of material adapted from MIT 6.004, "Computation Structures", Morgan Kaufmann "Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition", and CS 152 Slides by Isaac Scherson

## Modern Processor Topics - Performance

#### □ Transparent Performance Improvements

- Pipelining, Caches
- Superscalar, Out-of-Order, Branch Prediction, Speculation, ...
- $\circ~$  Covered in CS250A and others
- □ Explicit Performance Improvements
  - SIMD extensions, AES extensions, ...

0 ...



## **SIMD** operations

- □ Single ISA instruction performs same computation on multiple data
- Typically implemented with special, wider registers

#### **D** Example operation:

- $\circ~$  Load 32 bytes from memory to special register X
- $\circ~$  Load 32 bytes from memory to special register Y
- $\circ~$  Perform addition between each 4-byte value in X and each 4 byte value in Y

For i in (0 to 7): Z[i] = X[i] + Y[i];

- Store the four results in special register Z
- Store Z to memory
- □ RISC-V SIMD extensions (P) is still being worked on (as of 2021)

## Example: Intel SIMD Extensions

- □ More transistors (Moore's law) but no faster clock, no more ILP...
  - More capabilities per processor has to be explicit!
- □ New instructions, new registers
  - Must be used explicitly by programmer or compiler!
- □ Introduced in phases/groups of functionality
  - SSE SSE4 (1999 –2006)
    - 128 bit width operations
  - AVX, FMA, AVX2, AVX-512 (2008 2019)
    - 256 512 bit width operations
  - o F16C, and more to come?



#### Aside: Do I Have SIMD Capabilities?

#### □ less /proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p se36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm con stant\_tsc art arch\_perfmon pebs bts rep\_good nopl xtopology nonstop\_tsc cpuid aperfmp erf tsc\_known\_freg pni pclmulgdq dtes64 monitor ds\_cpl vmx est tm2 ssse3 sdbg fma\_cx1 6 xtpr pdcm pcid\_sse4\_1 sse4\_2 x2apic movbe popcnt tsc\_deadline\_timer aes xsave avx f 16c rdrand lahf\_lm abm 3dnowprefetch cpuid\_fault epb invpcid\_single pti ssbd ibrs ibp b stibp tpr\_shadow vnmi flexpriority ept vpid fsgsbase tsc\_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel\_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp\_notify hwp\_act\_window hwp\_epp flush\_l1d

## Intel SIMD Registers (AVX-512)



**ZMM31** 

□ XMM0 – XMM15

128-bit registers

o SSE

- □ YMM0 YMM15
  - 256-bit registers
  - $\circ$  AVX, AVX2
- ZMM0 ZMM31
  - 512-bit registers

• AVX-512

## SSE/AVX Data Types

| 25 | 55   | 0 |
|----|------|---|
|    | YMM0 |   |

|        | float float |   |   | float float |       |    |        |   |       |   |   |   | float float |   |        |   |       |   |   | flo   | at |   | float  |       |   |   |   |       |   |   |   |   |    |
|--------|-------------|---|---|-------------|-------|----|--------|---|-------|---|---|---|-------------|---|--------|---|-------|---|---|-------|----|---|--------|-------|---|---|---|-------|---|---|---|---|----|
| double |             |   |   |             |       |    | double |   |       |   |   |   |             |   | double |   |       |   |   |       |    |   | double |       |   |   |   |       |   |   |   |   |    |
|        | int32       |   |   |             | int32 |    |        |   | int32 |   |   |   | int32       |   |        |   | int32 |   |   | int32 |    |   |        | int32 |   |   |   | int32 |   |   |   |   |    |
|        | 1           | 6 | 1 | 6           | 1     | .6 | 1      | 6 | 1     | 6 | 1 | 6 | 1           | 6 | 1      | 6 | 1     | 6 | 1 | 6     | 1  | 6 | 1      | 6     | 1 | 6 | 1 | 6     | 1 | 6 | 1 | 6 |    |
|        | 8           | 8 | 8 | 8           | 8     | 8  | 8      | 8 | 8     | 8 | 8 | 8 | 8           | 8 | 8      | 8 | 8     | 8 | 8 | 8     | 8  | 8 | 8      | 8     | 8 | 8 | 8 | 8     | 8 | 8 | 8 | 8 | 0p |

Operation on 32 8-bit values in one instruction!

#### **Compiler Automatic Vectorization**

In gcc, flags "-O3 -mavx -mavx2" attempts automatic vectorization
 Works pretty well for simple loops

.L2:

```
int a[256], b[256], c[256];
void foo () {
   for (int i=0; i<256; i++) a[i] = b[i] * c[i];
}
```

```
vmovdqa xmm1, XMMWORD PTR b[rax]
add rax, 16
vpmulld xmm0, xmm1, XMMWORD PTR c[rax-16]
vmovaps XMMWORD PTR a[rax-16], xmm0
cmp rax, 1024
jne .L2
```

But not for anything complex

○ E.g., naïve bubblesort code not parallelized at all

Generated using GCC explorer: <u>https://gcc.godbolt.org/</u>

#### Intel SIMD Intrinsics

□ Use C functions instead of inline assembly to call AVX instructions

□ Compiler manages registers, etc

Intel Intrinsics Guide

- o <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide">https://software.intel.com/sites/landingpage/IntrinsicsGuide</a>
- $\circ~$  One of my most-visited pages...

e.g., \_\_m256 a, b, c; \_\_m256 d = \_mm256\_fmadd\_ps(a, b, c); // d[i] = a[i]\*b[i]+c[i] for i = 0 ...7

#### **Intrinsic Naming Convention**

#### \_mm<width>\_[function]\_[type]

 E.g., \_mm256\_fmadd\_ps : perform fmadd (floating point multiply-add) on 256 bits of packed single-precision floating point values (8 of them)

| Width | Prefix  |
|-------|---------|
| 128   | _mm_    |
| 256   | _mm256_ |
| 512   | _mm512_ |

Not all permutations exist! Check guide

| Туре                    | Postfix                |
|-------------------------|------------------------|
| Single precision        | _ps                    |
| Double precision        | _pd                    |
| Packed signed integer   | _epiNNN (e.g., epi256) |
| Packed unsigned integer | _epuNNN (e.g., epu256) |
| Scalar integer          | _siNNN (e.g., si256)   |

#### **Example: Vertical Vector Instructions**

#### □ Add/Subtract/Multiply

- o \_mm256\_add/sub/mul/div\_ps/pd/epi
  - Mul only supported for epi32/epu32/ps/pd
  - Div only supported for ps/pd
  - Consult the guide!
- □ Max/Min/GreaterThan/Equals
- □ Sqrt, Reciprocal, Shift, etc...
- □ FMA (Fused Multiply-Add)
  - (a\*b)+c, -(a\*b)-c, -(a\*b)+c, and other permutations!
  - o Consult the guide!



\_\_\_m256 a, b, c; \_\_\_m256 d = \_\_mm256\_fmadd\_\_pd(a, b, c);

#### Horizontal Vector Instructions

#### □ Horizontal add/subtraction

- $\circ~$  Adds adjacent pairs of values
- E.g., \_\_m256d \_mm256\_hadd\_pd (\_\_m256d a, \_\_m256d b)



## Shuffling/Permutation

#### Within 128-bit lanes

- \_mm256\_shuffle\_ps/pd/... (a,b, imm8)
- o \_mm256\_permute\_ps/pd
- o \_mm256\_permutevar\_ps/...
- Across 128-bit lanes
  - o \_mm256\_permute2x128/4x64 : Uses 8 bit control
  - \_mm256\_permutevar8x32/... : Uses 256 bit control
- Not all type permutations exist for each type, but variables can be cast back and forth between types

 $es = _mm256_permute_ps(vec, 0b01110100)$ 



Matt Scarpino, "Crunching Numbers with AVX and AVX2," 2016

#### Blend

Merges two vectors using a control

- o \_mm256\_blend\_...: Uses 8 bit control
  - e.g., \_mm256\_blend\_epi32
- \_mm256\_blendv\_... : Uses 256 bit control
  - e.g., \_mm256\_blendv\_epi8



## Alignr

□ Right-shifts concatenated value of two registers, by byte

- $\circ~$  Often used to implement circular shift by using two same register inputs
- \_mm256\_alignr\_epi8 (a, b, count)

Example of 64-bit values being shifted by 8



#### Helper Instructions

#### Cast

- o \_\_\_mm256i <-> \_\_\_mm256, etc...
- $\,\circ\,\,$  Syntactic sugar -- does not spend cycles
- Convert
  - 4 floats <-> 4 doubles, etc...
- Movemask
  - o \_\_\_mm256 mask to -> int imm8
- □ And many more...

## Case Study: Matrix Multiplication

□ Remember simply transposing matrix B brought 6x performance

- At that point, we are bottlenecked by single-thread processing performance
- $\circ~$  Adding SIMD gets us more!
- $\circ~$  After this we are again bottlenecked by memory, but that is for another time

VS







BT

63.19 seconds

2020 Secondds (29 x peformanae) !)

X

## Case Study: Sorting

- □ Important, fundamental application!
- □ Can be parallelized via divide-and-conquer
- □ How can SIMD help?

## The Two Register Merge

Sort units of two pre-sorted registers, K elements
 minv = A, maxv = B

#### $\circ$ // Repeat K times

- minv = min(minv,maxv)
- maxv = max(minv,maxv)
- // circular shift one value down
- minv = alignr(minv, minv, sizeof(int))



#### Inoue et.al., "SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures," VLDB 2015

## SIMD And Merge Sort

- Hierarchically merged sorted subsections
- **Using the SIMD merger for sorting** 
  - vector\_merge is the two-register sorter from before

```
aPos = bPos = outPos = 0;
vMin = va[aPos++];
vMax = vb[bPos++];
while (aPos < aEnd && bPos < bEnd) {
   /* merge vMin and vMax */
   vector_merge(vMin, vMax);
```

```
/* store the smaller vector as output*/
vMergedArray[outPos++] = vMin;
```

```
/* load next vector and advance pointer */
/* a[aPos*4] is first element of va[aPos] */
/* and b[bPos*4] is that of vb[bPos] */
if (a[aPos*4] < b[bPos*4])
    vMin = va[aPos++];
else
    vMin = vb[bPos++];
}</pre>
```

Inoue et.al., "SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures," VLDB 2015

### **Topic Under Active Research!**

#### □ Papers being written about...

- $\circ~$  Architecture-optimized matrix transposition
- $\circ$  Register-level sorting algorithm
- Merge-sort
- $\circ \ \mbox{...}$  and more!

Good find can accelerate your application kernel Nx

### Processor Microarchitectural Effects on Power Efficiency

- □ The majority of power consumption of a CPU is not from the ALU
  - Cache management, data movement, decoding, and other infrastructure
  - $\circ~$  Adding a few more ALUs should not impact power consumption
- □ Indeed, 4X performance via AVX does not add 4X power consumption
  - From i7 4770K measurements:
  - Idle: 40 W
  - $\circ$  Under load : 117 W
  - $\circ~$  Under AVX load : 128 W



Michael Taylor, "Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse," 2012

## Very Long Instruction Word (VLIW)

□ Superscalar does not change the ISA

- Complicates hardware in charge of detecting dependencies!
- □ What if we changed the ISA, and made the compiler manage ILP?

#### □ Not in x86/RISC-V/ARM/...

- Sometimes as accelerator extensions!
- (RISC-V "V" extension)

## Very Long Instruction Word (VLIW)

- Multiple instructions packaged into a Very Long Instruction
   o Sometimes "bundle"
- □ Each execution operation slot has a fixed function (ALU, Mem, FP, etc)
- Compiler's responsibility to create efficient instructions
  - $\circ~$  Inter-slot dependency is not checked by hardware!



Krste Asanovic, CS152, Berkeley

# Intel Explicitly Parallel Instruction Computing (EPIC, Itanium)



#### **VLIW Characteristics**

- □ Very good performance for computation-intensive code
- □ Very bad performance for code with many dependencies/hazards!
  - $\circ~$  Much more sensitive to hazards than single-issue pipelines
  - Example: short loops



How many FP ops/cycle?

1 fadd / 8 cycles = 0.125

## Compiler's job is important!

#### e.g., Loop unrolling to keep execution units busy





How many FLOPS/cycle?

4 fadds / 11 cycles = 0.36 <sup>CS152-Spring'09</sup>

#### Krste Asanovic, CS152, Berkeley

#### Issues with VLIW

Execution unit configurations change across models

- How many Integer units, how many float units, neural units ...?
- Cannot be binary compatible across models!
  - Unless hardware provides an abstraction layer...?
  - But that would add scheduler overhead, undermining VLIW (Itanium tried a good balance)
- Dependency/hazards difficult for compiler to manage
  - Too many slots end up empty (low performance, large binary)

#### But when it works well, it works remarkably well

- $\circ$  e.g., Scientific computing
- That's why it is often resurrected as potential solution (Itanium, ATi TeraScale, ...)

#### CS250P: Computer Systems Architecture Out of Order Processing



Sang-Woo Jun Fall 2022



Large amount of material adapted from MIT 6.004, "Computation Structures", Morgan Kaufmann "Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition", and CS 152 Slides by Isaac Scherson

#### **Back to Transparent Parallelism**

Explicit parallelism is not as popular as transparent

- Everyone wants performance for free!
- Can we keep execution slots busy, using backwards-compatible singlethread instruction streams?



Krste Asanovic, CS152, Berkeley

#### Skylake-X Microarchitecture (2019)



Anandtech

#### Apple M1 Microarchitecture (2020)



#### Anandtech

#### OoO: Determining dependencies



(13)

#### Data dependency types: RaW

- □ Read-after-Write (7)  $r5 \leftarrow MEM[r2]$ ○ A "true" dependency (9)  $r6 \leftarrow r4 + r5$ 
  - We must wait until r5's value is materialized... No other choice





## OoO managing dependencies

- Looks like dispatch+Commit stages added to VLIW
  - Instructions wait at "reservation stations"
  - $\circ~$  Listens to forwarding paths
    - "Is my input operand being written to"
  - $\circ~$  Forwarded to FU when ready
    - Out of order



### OoO managing dependencies

- Arithmetic can happen OoO, BUT Commits should happen In-Order!
  - Register writes, memory updates, etc
- Decoded instructions line up at Reorder Buffer(RoB)
  - Wait until execute results available
  - Wait until branch mispredict ruled out
  - $\circ$   $\,$  Commits in order of insertion  $\,$



### Many topics we won't go into today!

- Effectively matching available operands to waiting instructions
  - $\circ$   $\,$  Looping over instructions is too slow
  - N-to-N broadcast is too expensive (slow clocks!)
  - o Tomasulo's algorithm!

#### Precise interrupts become complicated

 Things are executing OoO, when a breakpoint happens, how do we line things back up for debugging?

#### Just one more topic: Register renaming

- □ Not all dependencies are RaW. Some can be resolved!
  - Write-after-Read (WaR)
    - e.g., 5->6
    - "Anti-dependence": r1's value clobbered after (6)
    - If we had used "r9" instead, no dependency!
  - Write-after-Write (WaW)
    - e.g., 7->7 across loop iteration
    - (7) does not read from "r5"...
    - If each loop iteration used a different reg, (r9, r10, r11,...) no dependency!

| (5)  | $r4 \leftarrow MEM[r1]$        |
|------|--------------------------------|
| (6)  | $r1 \leftarrow r1 + 4$         |
| (7)  | $r5 \leftarrow MEM[r2]$        |
| (8)  | $r2 \leftarrow r2 + 4$         |
| (9)  | $r6 \leftarrow r4 + r5$        |
| (10) | $\text{MEM}[r3] \leftarrow r6$ |
| (11) | $r3 \leftarrow r3 + 4$         |
| (12) | $r8 \leftarrow r8 - 1$         |
| (13) | bnz <mark>r8</mark> , LOOP     |

## OoO: Register renaming

#### □ Two different concepts of registers

- o "Architectural Registers": Conceptually defined in ISA, software abstraction
  - "RISC-V has 32 registers in the register file"
- "Physical Registers": Larger number of registers actually in silicon
  - Scheduler dynamically renames registers to an empty slot in the physical register file
  - When WaW or WaR dependencies are discovered



#### Register renaming: Previous example...



## Back to timely example: Apple M1

□ Really good single-thread performance!

- How?
  - "8-wide decoder" [...] "16 execution units (per core)"
  - "(Estimated) 630-deep out-of-order"

**RISC!** 

- "Unified memory architecture"
- Hardware/software optimized for each other



M1 Ultra Image source: wccftech

